library('glmmTMB')
library('lme4')
library('ggplot2')
library('ggthemes')
library('data.table')
library('DHARMa')
library('tidyverse')
library('kableExtra')model_selection_diagnostics
Libraries
Data characteristics
The final dataset includes a row for each unique combination of census tract, time period, age group, sex, and race. The outcome variable, Cases, is the aggregate number of cases for that combination. The Pop variable corresponds to the underlying population total in that category. Because population data was not available for cross-tabluations of race and ethnicity, race and ethnicity case and population totals have been calculated separately (each race category includes both Hispanic and non-Hispanic categories. Records have been merged with the SVI data that most closely corresponds to the year of diagnosis and with the NCHS urban-rural classification codes.
Data Preparation
Pacific Islander rates are very sparse and unstable, so I’m going to combine Asian and Pacific Islander records into and Asian-PI category
dat[, 'Race' := ifelse(Race=='P', 'A', Race)]
dat<-dat[, list(Cases=sum(Cases), Pop=sum(Pop)),
by=list(GEOID, sviyear, AgeGroup, Race, State, County, RPL_THEMES, quartile, UrCode)]Factor covariates and set desired reference category.
dat[, c('Race', 'UrCode', 'quartile', 'AgeGroup') := list(
factor(Race, levels=c('W', 'B', 'A', 'I', 'M', 'O'),
labels=c('White', 'Black', 'Asian-PI', 'AI-AN', 'Multi', 'Other')),
factor(UrCode, levels=c(1:6), labels=c('lrg cen met', 'lrg frg met', 'med met', 'sml met', 'micro', 'non-core')),
factor(quartile, levels=1:4),
factor(AgeGroup, levels=c('0-4', '5-14', '15-44', '45+')))]I’m also going to create a county-level dataset to compare to the census tract level dataset. This will become important because I’m forced to drop out census tract records with zero population and tracts with more cases than population. Because the numerator (Cases) and denominator (Pop) are coming from two different data sources and because the denominator is based on a sample with an associated MOE, there are quite a few records with mismatched data in the Asian-PI, Other, and Black categories. Exclusion of these records may result in biased results. The county dataset uses a population-weighted mean RPL_THEMES, but if I decide to use counties as my unit of analysis, I can use the actual county-level SVI and population estimates.
rpl_county<-dat[, list(RPL_THEMES=sum(Pop*RPL_THEMES, na.rm=T)/sum(Pop)), by=list(County, sviyear)]
rpl_county[, quartile := cut(RPL_THEMES, breaks=c(0, .25, .5, .75, 1), include.lowest=T, labels=F)]
cntydat<-dat[,list(Cases=sum(Cases), Pop=sum(Pop)), by=list(County, sviyear, AgeGroup, Race, State, UrCode)]
cntydat<-merge(cntydat, rpl_county, by=c('County', 'sviyear'), all.x=T)
rm(rpl_county)View data
Census tract-level dataset
head(dat) GEOID sviyear AgeGroup Race State County RPL_THEMES quartile
1: 06001400100 2000 0-4 Asian-PI CA 06001 0.029 1
2: 06001400100 2000 0-4 Black CA 06001 0.029 1
3: 06001400100 2000 0-4 AI-AN CA 06001 0.029 1
4: 06001400100 2000 0-4 Multi CA 06001 0.029 1
5: 06001400100 2000 0-4 Other CA 06001 0.029 1
6: 06001400100 2000 0-4 White CA 06001 0.029 1
UrCode Cases Pop
1: lrg cen met 0 42
2: lrg cen met 0 14
3: lrg cen met 0 0
4: lrg cen met 0 26
5: lrg cen met 0 0
6: lrg cen met 0 126
County-level dataset
head(cntydat) County sviyear AgeGroup Race State UrCode Cases Pop RPL_THEMES
1: 06001 2000 0-4 Asian-PI CA lrg cen met 6 41064 0.5110268
2: 06001 2000 0-4 Black CA lrg cen met 1 29908 0.5110268
3: 06001 2000 0-4 AI-AN CA lrg cen met 0 1214 0.5110268
4: 06001 2000 0-4 Multi CA lrg cen met 0 21442 0.5110268
5: 06001 2000 0-4 Other CA lrg cen met 5 26036 0.5110268
6: 06001 2000 0-4 White CA lrg cen met 10 77092 0.5110268
quartile
1: 3
2: 3
3: 3
4: 3
5: 3
6: 3
Data quality checks
Number of records with more cases than population by race
| Race | Tract | County |
|---|---|---|
| White | 39 | 0 |
| Black | 164 | 5 |
| Asian-PI | 105 | 7 |
| AI-AN | 63 | 4 |
| Multi | 58 | 1 |
| Other | 239 | 12 |
Percent of records with zero cases by race
| Race | Tract | Tract_Percent | County | County_Percent |
|---|---|---|---|---|
| White | 200989 | 93.8% | 6074 | 62.5% |
| Black | 207903 | 97.0% | 8181 | 84.2% |
| Asian-PI | 213504 | 99.6% | 9266 | 95.4% |
| AI-AN | 214074 | 99.9% | 9536 | 98.1% |
| Multi | 214021 | 99.8% | 9466 | 97.4% |
| Other | 212877 | 99.3% | 9153 | 94.2% |
There are 668 census-tract level records with more cases than population. This may be due to sampling error in the population data, which is sourced from the ACS for sviyears after 2005 or differential classification of individuals in the ACS data compared to FoodNet. Comparitively, there were 29 county-level records with more cases than population.
Ratio of variance to mean of cases by race, and year for census-tract and county level datasets
Distribution of selected geographic units by number of cases for all race groups
These plots shows distribution of tracts and counties by number of cases for each race category for all years and age groups
I have records with zero population so including population counts as an offset (i.e., log(0)=-inf error). Therefore, I’m dropping zero population counts and including the log of pop as an offset.
ctdat<-dat[Pop>0 & Cases <= Pop & !is.na(Race), ]
cntydat<-cntydat[Pop>0 & Cases <= Pop & !is.na(Race), ]I’m creating a boolean variable so classify a tract or county having one or more cases in a specific category. This value will be used as the outcome in the logistic regression models below. I’m also going to create a variable for sviyear centered around the mean to help model fit and convergence.
ctdat$HasCases<-ifelse(ctdat$Cases>0, 1, 0)
cntydat$HasCases<-ifelse(cntydat$Cases>0, 1, 0)
ctdat$CenteredYear<-ctdat$sviyear-mean(ctdat$sviyear)
ctdat$sviyear<-factor(ctdat$sviyear)
cntydat$CenteredYear<-cntydat$sviyear-mean(cntydat$sviyear)
cntydat$sviyear<-factor(cntydat$sviyear)Bivariate associations
Percent of tracts with one or more cases by SVI quartile and Race
Percent of counties with one or more cases by SVI quartile and Race
Shigella Rate per 100K by SVI Quartile and Race among Tracts One or More Cases
Shigella Rate per 100K by SVI Quartile and Race among Counties One or More Cases
Log(Shigella Rate per 100K) by SVI Score among Tracts One or More Cases by Age Each Race Category
Log(Shigella Rate per 100K) by SVI Score among Counties One or More Cases by Age Each Race Category
It looks like the effect of SVI score (RPL_THEMES) on Shigella incidence among counties with 1+ cases might be closer to a quadratic trend. I’ll create a new variable rpl_themes2 to reflect the quadratic effect of time.
cntydat[, rpl_themes2 := RPL_THEMES^2]Multivariate Analysis
I’m going to stratify my analysis by age group
ctdat<-ctdat %>% split(.$AgeGroup)
cntydat<-cntydat %>% split(.$AgeGroup)Logistic Regression Model Compared to Binomial Portion of Hurdle Model
Starting off with a very simple approach, I’m modeling the log(odds) of a county having one or more Shigella cases using generalized linear mixed-effects model (GLMM) from the lme4 package. This model allows me to include random intercepts for State. The the glmer function produces practically equivalent results as the glmmTMB package, but slight differences are present due to the Laplace Approximation used by glmmTMB.
Here, instead of including the log(Population) as an offset term as I would if modeling a rate, I’m just controlling for it.
mod.log0<-glmer(HasCases ~ quartile + log(Pop) + (1|State), family=binomial(link='logit'), data=cntydat$`0-4`)This model is a mirror image of the binomial portion of a hurdle model. The binomial portion of the hurdle model models the log odds of a county having no cases whereas the logistic model models the log odds of a county having one or more cases. The only difference is the sign on the parameter estimates (and the extra conditional model in the hurdle model).
Equivalent hurdle model for comparison
hurdle.comp<-glmmTMB(Cases ~ quartile + offset(log(Pop)) + (1|State), ziformula= ~log(Pop) + quartile + (1|State), family=truncated_poisson, data=cntydat$`0-4`)Log(odds) and confidence intervals from logistic regression
mod.log0.estimates<-round(cbind(na.omit(confint(mod.log0, method='Wald')), fixef(mod.log0)), 4)
colnames(mod.log0.estimates)[3]<-'Estimate'
mod.log0.estimates 2.5 % 97.5 % Estimate
(Intercept) -9.8693 -8.6154 -9.2424
quartile 0.1681 0.3674 0.2677
log(Pop) 0.8470 0.9332 0.8901
Log(odds) and confidence intervals from hurdle regression
hurdle.comp.estimates<-round(confint(hurdle.comp), 4)
hurdle.comp.estimates<-hurdle.comp.estimates[grepl('zi.', row.names(hurdle.comp.estimates)),]
hurdle.comp.estimates 2.5 % 97.5 % Estimate
zi.(Intercept) 8.6150 9.8697 9.2424
zi.log(Pop) -0.9332 -0.8470 -0.8901
zi.quartile -0.3674 -0.1680 -0.2677
zi.Std.Dev.(Intercept)|State 0.3940 0.9976 0.6270
Full logistic regression model of log(odds) of having one or more cases
I will try out a few different models with RPL_THEMES (continuous SVI score) and Race as my main covariates of interest, including additional fixed and random effects.
County-Level Data Base model
mod.log1<-glmer(HasCases ~ RPL_THEMES + Race + log(Pop) + sviyear + (1|State),
family=binomial(link='logit'),
data=cntydat$`0-4`)Add Urban-Rural covariate
mod.log2<-update(mod.log1, . ~ . + UrCode)Add interaction term between race and SVI score Note, I’m updating the optimizer because this model will not converge with the interaction term.
mod.log3<-update(mod.log2, . ~ . + Race*RPL_THEMES,
control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))Census Tract-Level Data
Base model
mod.log1.1<-glmer(HasCases ~ RPL_THEMES + Race + log(Pop) + sviyear + (1|State),
family=binomial(link='logit'),
data=ctdat$`0-4` )Add Urban-Rural covariate
mod.log2.1<-update(mod.log1.1, . ~ . + UrCode)Add interaction term between race and SVI score Note, I’m updating the optimizer because this model will not converge with the interaction term.
mod.log3.1<-update(mod.log2.1, . ~ . + Race*RPL_THEMES,
control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))Change random effect of State to County nested within state
mod.log4.1<-update(mod.log3.1, . ~ . + (1|County:State))View Fixed Effect Odds Ratios and 95% Confidence Intervals
Note, the interpretation of race/svi parameters in Model 4 is different due to interaction term!.
Models without interaction term
Models WITH interaction term
Table of fixed effects: odds ratio (95% CI)
| Covariate | mod.log1 | mod.log1.1 | mod.log2 | mod.log2.1 | mod.log3 | mod.log3.1 | mod.log4.1 |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0 (0-0) | 0 (0-0) | 0 (0-0) | 0 (0-0) | 0 (0-0) | 0 (0-0) | 0 (0-0) |
| RPL_THEMES | 3.5 (2.16-5.67) | 4.42 (3.93-4.98) | 4.67 (2.63-8.28) | 4.09 (3.63-4.62) | 12.82 (6.22-26.41) | 5.02 (4.29-5.88) | 4.84 (4.12-5.69) |
| RPL_THEMES:RaceAI-AN | NA | NA | NA | NA | 0.35 (0.04-2.9) | 1.83 (0.57-5.85) | 1.08 (0.35-3.32) |
| RPL_THEMES:RaceAsian-PI | NA | NA | NA | NA | 0.12 (0.03-0.57) | 0.27 (0.16-0.47) | 0.25 (0.15-0.44) |
| RPL_THEMES:RaceBlack | NA | NA | NA | NA | 0.09 (0.04-0.23) | 0.66 (0.5-0.86) | 0.61 (0.46-0.8) |
| RPL_THEMES:RaceMulti | NA | NA | NA | NA | 0.34 (0.07-1.63) | 0.47 (0.23-0.95) | 0.44 (0.22-0.88) |
| RPL_THEMES:RaceOther | NA | NA | NA | NA | 0.54 (0.16-1.8) | 0.87 (0.5-1.51) | 0.77 (0.44-1.34) |
| RaceAI-AN | 1.39 (0.96-2.02) | 0.73 (0.56-0.94) | 1.09 (0.74-1.62) | 0.74 (0.57-0.96) | 2.17 (0.61-7.66) | 0.46 (0.18-1.18) | 0.68 (0.28-1.65) |
| RaceAsian-PI | 1.68 (1.28-2.19) | 0.59 (0.5-0.7) | 1.36 (1.02-1.81) | 0.6 (0.5-0.71) | 4.24 (1.96-9.14) | 1.19 (0.87-1.62) | 1.22 (0.9-1.66) |
| RaceBlack | 1.85 (1.55-2.21) | 1.49 (1.39-1.6) | 1.68 (1.4-2.02) | 1.46 (1.35-1.57) | 6.95 (3.88-12.42) | 1.91 (1.57-2.32) | 1.95 (1.6-2.38) |
| RaceMulti | 0.42 (0.33-0.55) | 0.21 (0.17-0.25) | 0.35 (0.27-0.47) | 0.21 (0.17-0.25) | 0.69 (0.29-1.64) | 0.33 (0.21-0.52) | 0.34 (0.22-0.53) |
| RaceOther | 1.49 (1.19-1.88) | 0.63 (0.55-0.72) | 1.25 (0.98-1.6) | 0.64 (0.56-0.73) | 1.89 (0.9-3.96) | 0.69 (0.45-1.07) | 0.76 (0.5-1.16) |
| UrCodelrg frg met | NA | NA | 0.65 (0.47-0.89) | 0.66 (0.6-0.73) | 0.67 (0.49-0.92) | 0.66 (0.6-0.72) | 0.58 (0.39-0.88) |
| UrCodemed met | NA | NA | 0.91 (0.65-1.27) | 1.05 (0.95-1.17) | 0.97 (0.69-1.35) | 1.05 (0.95-1.17) | 0.71 (0.46-1.09) |
| UrCodemicro | NA | NA | 0.61 (0.43-0.86) | 0.95 (0.85-1.06) | 0.66 (0.46-0.93) | 0.94 (0.84-1.05) | 0.59 (0.39-0.9) |
| UrCodenon-core | NA | NA | 0.48 (0.33-0.7) | 0.65 (0.56-0.74) | 0.53 (0.37-0.78) | 0.64 (0.56-0.74) | 0.42 (0.28-0.65) |
| UrCodesml met | NA | NA | 0.7 (0.5-0.99) | 0.99 (0.89-1.11) | 0.75 (0.53-1.07) | 0.99 (0.88-1.1) | 0.6 (0.39-0.93) |
| log(Pop) | 2.56 (2.42-2.7) | 1.93 (1.87-2) | 2.43 (2.28-2.58) | 1.95 (1.89-2.02) | 2.5 (2.34-2.66) | 1.97 (1.9-2.04) | 2.01 (1.95-2.09) |
| sviyear2010 | 0.9 (0.73-1.11) | 1.25 (1.13-1.39) | 0.96 (0.77-1.18) | 1.26 (1.13-1.4) | 0.93 (0.75-1.15) | 1.24 (1.12-1.38) | 1.24 (1.11-1.37) |
| sviyear2014 | 1.18 (0.96-1.46) | 1.42 (1.28-1.57) | 1.24 (1-1.53) | 1.42 (1.28-1.58) | 1.21 (0.98-1.49) | 1.41 (1.27-1.56) | 1.4 (1.27-1.56) |
| sviyear2016 | 1.74 (1.41-2.15) | 1.38 (1.24-1.53) | 1.75 (1.42-2.16) | 1.39 (1.25-1.55) | 1.74 (1.41-2.15) | 1.38 (1.24-1.54) | 1.38 (1.24-1.54) |
| sviyear2018 | 0.85 (0.68-1.05) | 0.81 (0.72-0.91) | 0.87 (0.7-1.08) | 0.81 (0.72-0.91) | 0.85 (0.69-1.06) | 0.8 (0.72-0.9) | 0.79 (0.71-0.89) |
Model Fit Comparison and Diagnostics
County-Level Models
npar AIC BIC logLik deviance
mod.log1 13 6210.6 6306.4 -3092.3 6184.6
mod.log2 18 6190.0 6322.7 -3077.0 6154.0
mod.log3 23 6171.0 6340.5 -3062.5 6125.0
Census-Tract Level Models
npar AIC BIC logLik deviance
mod.log1.1 13 35034 35165 -17504 35008
mod.log2.1 18 34890 35072 -17427 34854
mod.log3.1 23 34869 35101 -17412 34823
mod.log4.1 24 34186 34427 -17069 34138
Evaluate quantile residuals from full logistic model
Quantile residuals have been shown to be more appropriate for GLMM model diagnostics. The cumulative probability that an observation is less than or equal to the fitted value of a given distribution is determined and then used to find the corresponding value of the standard normal variate (i.e., quantile residual).
I’m going to test output from Model #2 (County-level model without interaction).
Model #4 appears to have the best fit among the census-tract level models (based on the AIC) but answers a slightly different question than the original research question (what is the effect of race after accounting for SVI on the incidence of Shigella?). Due to the inclusion of the interaction between race and SVI, Model #4 shows what the impact of SVI is for different race categories.
Simulate Residuals for Model #2
mod2simout<-simulateResiduals(fittedModel = mod.log2, plot=F)View simulated residuals
plot(mod2simout)The Kolmogorov-Smirnov test shows a poor goodness of fit in Model #2, however this test is sensitive to Type I errors and because I have a large amount of data (n=11699), this result might not be too much of an issue. The QQ is nearly linear, suggesting that the overall distribution is acceptable. from I’ll go through some more diagnostics to evaluate if there are any obvious problems with the fit of Model #2.
Non-parametric test for overdispersion
testDispersion(mod2simout)
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.1414, p-value = 0.28
alternative hypothesis: two.sided
Test for zero-inflation
testZeroInflation(mod2simout)
DHARMa zero-inflation test via comparison to expected zeros with
simulation under H0 = fitted model
data: simulationOutput
ratioObsSim = 0.97484, p-value = 0.288
alternative hypothesis: two.sided
Test for outliers
testOutliers(mod.log2)
DHARMa outlier test based on exact binomial test with approximate
expectations
data: mod.log2
outliers at both margin(s) = 108, observations = 11699, p-value =
0.1311
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.00757880 0.01113494
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.009231558
There is no significant difference in the observed frequency of outliers compared to expected.
| Outlier Type | |
|---|---|
| Observed | 104 |
| Expected | 93 |
View simulated residuals plotted against covariates in Model #2
plotResiduals(mod2simout, form = na.omit(cntydat$`0-4`$RPL_THEMES))plotResiduals(mod2simout, form = na.omit(cntydat$`0-4`)$Race)plotResiduals(mod2simout, form = na.omit(cntydat$`0-4`)$sviyear)plotResiduals(mod2simout, form = na.omit(cntydat$`0-4`)$UrCode)plotResiduals(mod2simout, form = with(na.omit(cntydat$`0-4`), Pop))Hurdle Model (County-Level Data)
I will create a list of hurdle poisson and hurdle negative binomial models to compare fit. Each model includes the log of population as an offset term and a random intercept for state and/or county nested in state in the conditional portion of the model. Different covariates are evaluated in both the conditional and the zero-inflation (binomial) portion of the model.
###Truncated Poisson Models
Null Model
modhur0<-glmmTMB(Cases ~ offset(log(Pop)) + (1|State), zi= ~1, family=truncated_poisson, data=cntydat$`0-4`)Base Model, includes essential covariates
modhur1<-update(modhur0, . ~ . + Race + RPL_THEMES + sviyear)Add urban-rural designation
modhur2<-update(modhur1, . ~ . + UrCode)Add quadratic trend of RPL_THEME
modhur3<-update(modhur2, . ~ . + (rpl_themes2))#Compare models
hurdle.poisson.comparison<-anova(modhur0, modhur1, modhur2, modhur3)
hurdle.poisson.comparison[,c(1:5, 8)] %>% arrange(desc(AIC)) Df AIC BIC logLik deviance Pr(>Chisq)
modhur0 3 19401 19423 -9697.6 19395
modhur1 13 17891 17987 -8932.6 17865 < 2e-16 ***
modhur2 18 17732 17864 -8847.9 17696 < 2e-16 ***
modhur3 19 17731 17871 -8846.4 17693 0.08349 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add covariates to binomial portion of the model It doesn’t appear that the quadratic effect of RPL_THEMES adds anything to the model, so I’m dropping out that term.
modhur3<-update(modhur2, ziformula = ~ . )###Truncated negative binomial distribution
modhur0.1<-update(modhur0, family=truncated_nbinom2)
modhur1.1<-update(modhur1, family=truncated_nbinom2)
modhur2.1<-update(modhur2, family=truncated_nbinom2)
modhur3.1<-update(modhur3, family=truncated_nbinom2)Hurdle model comparison
| Conditional Formula | ZI Formula | Distribution | AIC | Covergence | DF | Dispersion Parameter |
|---|---|---|---|---|---|---|
| Cases ~ offset(log(Pop)) + (1 | State) | 1 | truncated_poisson | 19401.19 | Yes | 3 | 1.0000000 |
| Cases ~ offset(log(Pop)) + (1 | State) | 1 | truncated_nbinom2 | 16169.63 | Yes | 4 | 0.3671244 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + offset(log(Pop)) | 1 | truncated_poisson | 17891.15 | Yes | 13 | 1.0000000 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + offset(log(Pop)) | 1 | truncated_nbinom2 | 15941.57 | Yes | 14 | 0.8542466 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + UrCode + offset(log(Pop)) | 1 | truncated_poisson | 17731.83 | Yes | 18 | 1.0000000 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + UrCode + offset(log(Pop)) | 1 | truncated_nbinom2 | 15939.84 | Yes | 19 | 0.8946618 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + UrCode + offset(log(Pop)) | (1 | State) + Race + RPL_THEMES + sviyear + UrCode | truncated_poisson | 15207.73 | Yes | 34 | 1.0000000 |
| Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + UrCode + offset(log(Pop)) | (1 | State) + Race + RPL_THEMES + sviyear + UrCode | truncated_nbinom2 | 13415.74 | Yes | 35 | 0.8946559 |
Extract model with best overall fit
final_mod<-models[[which.min(sapply(1:length(models),function(x)AIC(models[[x]])))]]Final Model Summary
summary(final_mod) Family: truncated_nbinom2 ( log )
Formula: Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + UrCode +
offset(log(Pop))
Zero inflation: ~.
Data: cntydat$`0-4`
AIC BIC logLik deviance df.resid
13415.7 13673.6 -6672.9 13345.7 11664
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
State (Intercept) 0.623 0.7893
Number of obs: 11699, groups: State, 10
Zero-inflation model:
Groups Name Variance Std.Dev.
State (Intercept) 0.4914 0.701
Number of obs: 11699, groups: State, 10
Dispersion parameter for truncated_nbinom2 family (): 0.895
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.80933 0.33093 -29.642 < 2e-16 ***
RaceBlack 0.82618 0.08007 10.318 < 2e-16 ***
RaceAsian-PI 0.04081 0.18447 0.221 0.824915
RaceAI-AN 1.05289 0.29136 3.614 0.000302 ***
RaceMulti -0.12911 0.21783 -0.593 0.553359
RaceOther 1.25231 0.13677 9.156 < 2e-16 ***
RPL_THEMES 1.54030 0.33204 4.639 3.50e-06 ***
sviyear2010 -0.37077 0.11372 -3.260 0.001112 **
sviyear2014 -0.50732 0.11489 -4.416 1.01e-05 ***
sviyear2016 -0.14750 0.12475 -1.182 0.237033
sviyear2018 -0.80049 0.13215 -6.057 1.38e-09 ***
UrCodelrg frg met -0.19967 0.12710 -1.571 0.116178
UrCodemed met 0.05366 0.14435 0.372 0.710096
UrCodesml met 0.07126 0.14953 0.477 0.633675
UrCodemicro 0.23843 0.15083 1.581 0.113914
UrCodenon-core 0.01036 0.17225 0.060 0.952049
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.36073 0.29602 -4.597 4.29e-06 ***
RaceBlack 0.89433 0.07551 11.845 < 2e-16 ***
RaceAsian-PI 2.31655 0.11263 20.568 < 2e-16 ***
RaceAI-AN 3.34070 0.17441 19.154 < 2e-16 ***
RaceMulti 3.04576 0.12675 24.029 < 2e-16 ***
RaceOther 2.07071 0.09908 20.899 < 2e-16 ***
RPL_THEMES -0.22986 0.25180 -0.913 0.36133
sviyear2010 -0.85910 0.09795 -8.771 < 2e-16 ***
sviyear2014 -0.86220 0.09720 -8.871 < 2e-16 ***
sviyear2016 -0.62975 0.09956 -6.325 2.53e-10 ***
sviyear2018 -0.30792 0.10285 -2.994 0.00275 **
UrCodelrg frg met 2.03003 0.14596 13.908 < 2e-16 ***
UrCodemed met 2.00529 0.15319 13.090 < 2e-16 ***
UrCodesml met 2.50446 0.15693 15.959 < 2e-16 ***
UrCodemicro 2.77952 0.15325 18.137 < 2e-16 ***
UrCodenon-core 3.67074 0.16002 22.939 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I looks like I could probably drop the urban-rural designation from the conditional portion of the model and RPL_THEMES (linear and quadratic) from the binomial portion of the model.
Visualize simulated residuals from best fit model.
final_mod<-update(final_mod, . ~ . -UrCode, ziformula = ~ Race + sviyear + UrCode)
summary(final_mod) Family: truncated_nbinom2 ( log )
Formula:
Cases ~ (1 | State) + Race + RPL_THEMES + sviyear + offset(log(Pop))
Zero inflation: ~Race + sviyear + UrCode
Data: cntydat$`0-4`
AIC BIC logLik deviance df.resid
13757.9 13964.2 -6850.9 13701.9 11671
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
State (Intercept) 0.6177 0.7859
Number of obs: 11699, groups: State, 10
Dispersion parameter for truncated_nbinom2 family (): 0.854
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.03369 0.29856 -33.61 < 2e-16 ***
RaceBlack 0.83274 0.08002 10.41 < 2e-16 ***
RaceAsian-PI 0.01231 0.18446 0.07 0.946798
RaceAI-AN 1.02626 0.29336 3.50 0.000468 ***
RaceMulti -0.15835 0.21761 -0.73 0.466821
RaceOther 1.25252 0.13841 9.05 < 2e-16 ***
RPL_THEMES 1.97767 0.25811 7.66 1.83e-14 ***
sviyear2010 -0.34589 0.11458 -3.02 0.002538 **
sviyear2014 -0.53601 0.11529 -4.65 3.33e-06 ***
sviyear2016 -0.17320 0.12544 -1.38 0.167347
sviyear2018 -0.82201 0.13246 -6.21 5.45e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.40412 0.14541 -9.656 < 2e-16 ***
RaceBlack 0.80737 0.07316 11.036 < 2e-16 ***
RaceAsian-PI 2.24654 0.10939 20.536 < 2e-16 ***
RaceAI-AN 3.26317 0.16826 19.394 < 2e-16 ***
RaceMulti 2.91850 0.12346 23.639 < 2e-16 ***
RaceOther 1.97697 0.09632 20.525 < 2e-16 ***
sviyear2010 -0.82224 0.09547 -8.613 < 2e-16 ***
sviyear2014 -0.82331 0.09476 -8.689 < 2e-16 ***
sviyear2016 -0.60128 0.09708 -6.193 5.89e-10 ***
sviyear2018 -0.29201 0.10032 -2.911 0.0036 **
UrCodelrg frg met 1.86586 0.13293 14.036 < 2e-16 ***
UrCodemed met 1.94684 0.14135 13.774 < 2e-16 ***
UrCodesml met 2.26486 0.14189 15.962 < 2e-16 ***
UrCodemicro 2.72396 0.13795 19.746 < 2e-16 ***
UrCodenon-core 3.43738 0.14241 24.138 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
simulationOutput <- simulateResiduals(fittedModel = final_mod, plot=F)
plot(simulationOutput)DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
testDispersion(simulationOutput = simulationOutput, alternative ="less")
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.99515, p-value = 0.656
alternative hypothesis: less
testOutliers(simulationOutput)DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 143, observations = 11699, p-value =
1.547e-06
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.01031151 0.01438313
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.01222327
Outliers seem to be flagged as a problem for this model. The DHARMa documentation indicates that binomial models may have an inflated Type I error rate and to re-run the outlier test with bootstrap to achieve a more exact result.
testOutliers(simulationOutput, type='bootstrap')
DHARMa bootstrapped outlier test
data: simulationOutput
outliers at both margin(s) = 54, observations = 11699, p-value = 0.22
alternative hypothesis: two.sided
percent confidence interval:
0.001066330 0.007707924
sample estimates:
outlier frequency (expected: 0.00290794084964527 )
0.004615779
Plot residuals against covariates in model
plotResiduals(simulationOutput, form = na.omit(cntydat$`0-4`$RPL_THEMES))plotResiduals(simulationOutput, form = na.omit(cntydat$`0-4`$Race))plotResiduals(simulationOutput, form = na.omit(cntydat$`0-4`$sviyear))plotResiduals(simulationOutput, form = na.omit(cntydat$`0-4`$UrCode))plotResiduals(simulationOutput, form = na.omit(cntydat$`0-4`$Pop))